Method 

Genome of the pathogen Porphyromonas gingivalis 
recovered from a biofilm in a hospital sink using 
a high-throughput single-cell genomics platform 
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Although biofilms have been shown to be reservoirs of pathogens, our knowledge of the microbial diversity in biofilms 
within critical areas, such as health care facilities, is limited. Available methods for pathogen identification and strain 
typing have some inherent restrictions. In particular, culturing will yield only a fraction of the species present, PCR of 
virulence or marker genes is mainly focused on a handful of known species, and shotgun metagenomics is limited in the 
ability to detect strain variations. In this study, we present a single-cell genome sequencing approach to address these 
limitations and demonstrate it by specifically targeting bacterial cells within a complex biofilm from a hospital bathroom 
sink drain. A newly developed, automated platform was used to generate genomic DNA by the multiple displacement 
amplification (MDA) technique from hundreds of single cells in parallel. MDA reactions were screened and classified by 
16S rRNA gene PCR sequence, which revealed a broad range of bacteria covering 25 different genera representing en- 
vironmental species, human commensals, and opportunistic human pathogens. Here we focus on the recovery of a nearly 
complete genome representing a novel strain of the periodontal pathogen Porphyromonas gingivalis [P. gingivalis ]CV\ SC001) 
using the single-cell assembly tool SPAdes. Single-cell genomics is becoming an accepted method to capture novel genomes, 
primarily in the marine and soil environments. Here we show for the first time that it also enables comparative genomic 
analysis of strain variation in a pathogen captured from complex biofilm samples in a healthcare facility. 

[Supplemental material is available for this article.] 



Ongoing efforts to understand the genomic diversity of microbes 
in nature and in human health are hampered by the limited 
availability of cultivated organisms and their genomes (The Hu- 
man Microbiome Jumpstart Reference Strains Consortium 2010). 
Only 1%-10% of known bacterial species (Rappe and Giovannoni 
2003) are thought to be currently cultivated, although great 
progress is being made for some bacterial communities; for ex- 
ample, about half of bacterial species within the human oral cavity 
have been cultivated (Dewhirst et al. 2010). The recent advance- 
ments in DNA sequencing of single bacterial cells (Raghunathan 
et al. 2005) have accelerated the discovery of uncultivated mi- 
crobes (Lasken 2012), providing genomic assemblies for species 
previously known only from 16S rRNA clone libraries and meta- 
genomic data (Marcy et al. 2007; Podar et al. 2007; Binga et al. 
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2008; Eloe et al. 2011; Youssef et al. 2011; Dupont et al. 2012). This 
newly developed methodology provides a culture-independent 
approach to capture the genomes of uncultivated organisms, 
which can then be integrated into many intensive genomics- 
based studies. A high-throughput strategy was recently estab- 
lished to sequence and assemble single-cell genomes of bacteria 
(Chitsaz et al. 2011) and viruses (Allen et al. 2011), including 
novel uncultivated bacteria from environmental samples (Chitsaz 
et al. 2011; Eloe et al. 2011; Dupont et al. 2012). The workflow 
consists of (1) delivery of single bacterial cells into 384-well 
microtiter wells by fluorescence activated cell sorting (FACS); (2) 
use of a robotic platform to perform 384-well automated cell lysis 
and amplification of DNA by the multiple displacement ampli- 
fication (MDA) method (Dean et al. 2001, 2002; Hosono et al. 
2003) to create libraries of genomic DNA derived from single 
cells; (3) PCR and cycle sequencing of 16S rRNA genes to profile 
the taxonomy and diversity of the libraries; (4) selection of 
candidate amplified genomes for whole-genome sequencing; 
and (5) sequencing and assembly of selected genomes using as- 
sembly tools designed specifically for MDA-amplified single cells 
(Chitsaz et al. 2011; Bankevich et al. 2012). A highly integrated 
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robotic platform, described in this study for the first time, was 
used to increase the throughput, ease, and overall cost of pro- 
cessing single cells. 

Here we have focused this approach on the indoor environ- 
ment. Despite the fact that a typical person spends —90% of their 
time indoors (Klepeis et al. 2001), there is little known about the 
microbial diversity of this environment. Of particular interest is 
the prevalence of species affecting human health, including both 
opportunistic and primary pathogens. Recent studies of indoor 
environments using culture-independent molecular methods in- 
dicate an unexpectedly high bacterial diversity on surfaces within 
daycare facilities and public bathroom facilities (Lee et al. 2007; 
Flores et al. 2011), where the majority of organisms in the latter 
environment were human associated (Flores et al. 2011). Another 
study shows that bacterial diversity is lower in indoor air at 
a healthcare facility compared with outdoor air; however, the in- 
door air contained a higher number of potential human pathogens 
as shown by 16S rRNA gene sequence analyses (Kembel et al. 
2012). Biofilms in particular are thought to be reservoirs of disease- 
causing organisms in both outdoor and indoor environments. 
Several pathogens, including Escherichia coli, Vibrio cholerae (Shikuma 
and Hadfield 2010), and Helicobacter pylori (Percival and Thomas 
2009; Linke et al. 2010), have been detected in biofilms within 
water distribution systems. In addition, the long-term persistence 
of Legionella pneumophila, the causative agent of Legionnaire's 
disease, in biofilms within natural and human-impacted fresh- 
water environments is well known (Walker et al. 1993; Murga 
et al. 2001; Declerck 2010; Giao et al. 2011). Recent 16S rRNA 
gene molecular surveys have revealed a significant load of 
Mycobacterium avium in showerhead biofilms (Feazel et al. 2009), 
and studies on biofilms growing on shower curtains suggest that 
these communities also harbor potential opportunistic pathogens 
that can threaten immune-compromised patients (Kelley et al. 
2004). In another study, the source of a deadly outbreak of a mul- 
tidrug-resistant strain of Pseudomonas aeruginosa was traced to 
biofilms in hand hygiene sink drains, where its viable cells could be 
identified (Hota et al. 2009). 

There is great interest, therefore, to investigate biofilms as 
reservoirs of pathogens at higher resolution than allowed by the 
most commonly used detection and identification methodologies. 
Culture-independent surveys using the 16S rRNA gene as a marker 
are currently the most widely used approach; however, genetic 
strain differences reflecting pathogenicity are often difficult to re- 
solve due to this gene being highly conserved among many bacterial 
strains. Quantitative PCR and direct culturing are focused on either 
a handful of predetermined pathogens or what can be readily cul- 
tivated. Metagenomic surveys are becoming common, but so far, 
our ability is limited to accurately predicting taxonomic affiliation 
at species or strain levels from highly diverse and complex data sets. 
Additionally, a whole-genome comparative genomic study on the 
evolution and transmission of a pathogen requires substantial 
amounts of DNA or a cultured strain, which often cannot be 
obtained. It has been demonstrated in a controlled experiment with 
10 pg of extracted DNA provided as a template that MDA-amplihed 
genotyping call and accuracy rates were only slightly lower than 
those for genomic DNA isolated directly from cultured cells (Giardina 
et al. 2009). Using single-cell genomic approaches, partial to near 
complete genomes should be obtainable without cultivation, from 
difficult samples within critical indoor environments such as 
healthcare facilities. In-depth analyses of these genomic data can 
then provide accurate and detailed information of strain-specific 
pathogen-gene signatures and other virulence factors. 



The aim of this study was to investigate for the first time the 
bacteria present in a healthcare facility with a high-throughput 
single-cell genomics approach. Based on the known prevalence of 
pathogens in biofilms, we focused on a sink drain biofilm from 
a public restroom adjoining an emergency waiting room. Se- 
quencing 16S rRNA genes PCR-amplified from 416 single-cell MDA 
reactions, we found 18 candidate commensal and potentially 
pathogenic species that were selected for 454 shallow sequencing. 
Initial read mapping and de novo assembly of the low-coverage 
454 sequence data confirmed that we had obtained genomic se- 
quences for the pathogen Streptococcus pneumoniae as well as bac- 
terial species highly similar to and those reported to be potentially 
pathogenic, including Sphingobacterium spiritivorum (Tronel et al. 
2003; Kampfer et al. 2005), Leptotrichia buccalis (Hammann et al. 
1993; Hot et al. 2008), as well as the host-associated oral bacteria, 
Streptococcus mitis and Veillonella parvula. Of particular note, we 
found three MDA products with sequences for the oral pathogen 
Porphyromonas gingivalis, which is a periodontal pathogen in- 
volved in periodontal bone loss that has also been linked to pro- 
gression of atherosclerotic disease (Pussinen et al. 2007; Yilmaz 
2008). P. gingivalis possesses many virulence factors, including 
functions that allow it to survive intracellularly and to be trans- 
mitted between different types of host cells (Li et al. 2008). Despite 
being detected at a very low abundance in the oral cavity, P. gin- 
givalis can strongly disrupt the host-microbial homeostasis 
(Hajishengallis et al. 2011). As with many pathogens, the envi- 
ronmental reservoirs and mode(s) of transmission of P. gingivalis 
are not fully understood, yet it is a globally important pathogen 
with only three sequenced genomes available at the time of this 
report. It was recently stated by a CDC report that nearly 50% of 
American adults have mild, moderate, or severe periodontitis, and 
this percentage rises to 70% in adults greater than age 65 (Eke et al. 
2012). To our knowledge, there are no previous reports detecting 
P. gingivalis outside of a host. 

Three MDA-amplified genomes with 16S rRNA gene se- 
quences identified as P. gingivalis were chosen for additional deep 
sequencing on the Illumina GA IIx platform, and the resulting 
reads were mapped to P. gingivalis genomes. One MDA-read data 
set had —90% sequence coverage to P. gingivalis strain TDC60, 
which was isolated from a patient in Japan with severe peri- 
odontitis (Watanabe et al. 2011). A new single-cell de novo as- 
sembly algorithm, SPAdes (Bankevich et al. 2012), was used to 
generate contigs of the highest-coverage MDA product, which 
produced a 2.35-Mb draft genome (PG JCVI SC001). Comparative 
genomics and pangenome analyses were performed with the 
three other available P. gingivalis genomes; virulent strains W83 
(Nelson et al. 2003) and TDC60 (Watanabe et al. 2011), and the 
less virulent strain ATCC 33277 (Naito et al. 2008). We demonstrate 
that single-cell genomics is a powerful approach that can produce 
highly accurate sequence data, enabling comparative genomic 
studies of pathogens obtained from a complex heterogeneous en- 
vironmental sample. 

Results 

Sampling and sorting cells from biofilms 

Seawater-derived samples contain relatively high bacterial counts 
and were a rich source for sorting single cells by FACS (Chitsaz 
et al. 2011; Dupont et al. 2012) with ~20%-30% of single-cell 
amplifications yielding amplified genomic DNA based on 
sequencing of PCR-amplified 16S rRNA genes (Methods). In 
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contrast, attempts to randomly sort single bacterial cells from 
the indoor environment, such as from surface swabs (data not 
shown) and sink drains, yielded <1% amplified bacterial genomes 
due to failure to lyse cells or due to noise from fluorescence 
background signals of nonbacterial particles during sorting. To 
reduce the background noise of nonbacterial events, which ob- 
scure bacterial cells stained with SYBR Green, the biofilm sam- 
ple was vortexed, filtered through a 5-jxm filter, and processed 
through a Nycodenz cushion (Methods). After processing, the 
percentage of positive stained events that were in the bacterial 
size range approached 20% of the total particle count (Supple- 
mental Fig. SI). This positive gate was used to sort single events 
into 384-well microtiter plates. 

Microtiter plates containing single sorted events were then 
processed on the highly integrated high-throughput single-cell 
platform (Methods; Supplemental Fig. S2). In total, 78 MDAs 
of 416 sorted wells were identified as candidates based on the 
taxonomy of their 16S rRNA gene sequence. This 19% overall 
success rate does not include 16S rRNA gene sequences that can 
be attributed to common bacterial MDA reagent contaminants 
detected in control MDA reactions without a sorted cell (NTC, no 
template control). NTCs were always run in parallel with each sort 
of single cells to determine the relative amount and identity of 
contaminating bacterial DNA in the MDA reagents; this is a nec- 
essary standard practice in single-cell genomics due to the highly 
processive strand displacement activity of the phi29 DNA poly- 
merase and the near-ubiquitous presence of bacterial DNA in 
reagents (Allen et al. 2011; Blainey and Quake 2011; Woyke et al. 
2011). Based on 16SrRNA gene analysis, a wide diversity of genera 
was found in this random sort from the sink biofilm community 
(Fig. 1). 

Screening MDA products using multiplexed 454 sequencing 

A total of 18 candidate MDA products from the sink biofilm sample 
were of interest in terms of their relationship to human health 
being reported as potentially pathogenic or a commensal species. 
These were each sequenced as barcoded libraries on one-fourth of 
a 454 plate. A range of 5500-13,500 reads was obtained for the 18 
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Figure 1. Distribution of 78 candidate 16S rRNA sequences found in 
single-cell sorted wells from the sink biofilm sample. 1 6S rRNA sequences 
from single-cell amplifications observed for this FACS sorted sample. 



libraries, with overall average read length of 321 bp. Complex 
environmental samples, such as a sink biofilm, may be difficult to 
lyse, had poorer amplification, and are more likely to contami- 
nate single-cell MDA reactions with free DNA from other organ- 
isms. The 454 data sets for the 18 candidate MDAs were therefore 
screened for quality, contamination, and overall suitability for 
analysis by several criteria: (1) the presence of reads having BLAST 
matches (NCBI nr database) to the same genera indicated by the 
16S rRNA gene; (2) confirmation of identity for contigs generated 
from de novo assembly; and (3) successful mapping of reads to 
a representative sequenced genome. A total of nine of the original 
18 MDAs met these criteria. These comprise species previously 
reported to be pathogenic or potentially pathogenic, including 
S. pneumoniae (1 MDA product), S. spiritivorum (1), L. buccalis (1), 
P. gingivalis (3), S. mitis (2), and the commensal oral bacteria 
V. parvula (1). These 454 reads were assembled, and the open 
reading frames were annotated to determine the genes captured 
from these species, as well as to assess potential for full genome 
sequencing (Supplemental Table SI). Although many of these 
products were very intriguing, P. gingivalis was chosen for detailed 
analysis because there was more than one MDA representing 
a genome, they passed all the quality criteria, and all three of the 
MDAs contained a high proportion of genomic sequences from 
P. gingivalis (labeled MDA 1, 2, and 3). These were chosen for deep 
sequencing using the Illumina GAIIx platform on one-third of 
a lane each. 

Read-based analyses of the P. gingivalis MDA products 

Although Porphyromonas is a widespread and well recognized 
pathogen, only three P. gingivalis genomes have been sequenced: 
strain W83 (virulent) (Nelson et al. 2003), ATCC 33277 (less 
virulent than W83) (Naito et al. 2008), and the most recent, 
TDC60 (virulent) (Watanabe et al. 2011). A majority of the raw 
100-bp paired-end reads for each of the three MDA products 
yielded BLAST hits to the newly completed genome of a virulent 
P. gingivalis strain (TDC60) from a severe periodontal lesion in 
a Japanese patient (Watanabe and Frommel 1993), confirming 
the 16S rRNA gene PCR sequence matches. The total reads pass- 
ing quality filtering, along with the results of mapping the reads 
of the single cells to strain TDC60, are summarized in Table 1. 
Coverage of the reference genome varied with 41%, 87%, and 
91% for MDA 1, 2, and 3, respectively. As with many MDA-de- 
rived single-cell genomes, the coverage depth varied widely 
across the genome: The average coverage for MDA3 was 23 7 X, 
with —90% of the genome covered at 10 x or greater (Supple- 
mental Fig. S3). Single nucleotide polymorphism (SNP) and 
deletion insertion and polymorphism (DIP) analyses revealed 
847 shared SNPs between the three amplified genomes with re- 
spect to TDC60, with 791 in coding regions (202 missense) 
(Supplemental Fig. S4). There were 75 shared DIPs in total. Al- 
though MDA1 had the highest number of reads, it had the lowest 
reference genome coverage. Mapping of reads from all three 
MDAs did not increase the percentage of the reference genome 
mapped compared with coverage by MDA3 alone. 

De novo assembly of P. gingivalis genomes 

All three MDA sequence data sets were assembled de novo. Our 
assembly analysis described here is restricted to MDA3, which 
corresponds to the MDA having the highest genome coverage to 
TDC60. Two newly developed de novo assembly tools specifically 
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Table 1 . Individual amplified single cells and combined single-cell 
read mapping against P. gingivalis TDC60 reference 





MDA1 


MDA2 


MDA3 


MDA123 


Total read count 


8,873,267 


990,119 


5,756,473 


15,619,859 


Percent of reference 


40 


87 


91 


91 


mapped (%) 










Maximum coverage 3 


20,717 


1074 


10,059 


21,726 


Average coverage 6 


366 


41 


238 


644 



a The highest coverage in the region. 

b Average coverage is calculated by summing up the bases of the aligned 
part of all the reads divided by the length of the reference sequence in- 
cluding zero coverage regions. 



designed to assemble reads generated from single-cell MDA re- 
actions, Euler correction with Velvet-Single Cell E+V-SC (Chitsaz 
et al. 2011) and SPAdes (Bankevich et al. 2012), were compared 
(Table 2) with the Velvet assembly tool (Zerbino and Birney 2008). 
For single-cell projects, these significantly improve on traditional 
assemblers (designed for mono-species cultured samples), since 
they are able to cope with the wide variations in coverage and el- 
evated numbers of chimeric reads and read pairs characteristic of 
MDA reactions. Completeness of the assemblies was assessed by 
using Plantagora (Barthelson et al. 2011) to compare with the 
reference. Plantagora determines "misassembly breakpoints" with 
the assumption that the data set being assembled is a sample of the 
reference genome. It is emphasized here that since contigs are 
compared against a reference from a similar but nonidentical 
strain, the term "misassembly breakpoints" does not apply and the 
term "breakpoints versus TDC60" is used instead. Assemblers were 
compared based on the fraction of the genome covered by the 
assembled contigs, and an adjusted N 50 , in which assembly contigs 
are broken into multiple contigs at the breakpoints determined by 
Plantagora. Sequence in the contigs that 
does not align to the TDC60 reference is 
not counted. Note that the adjusted N 50 is 
likely to eliminate misassemblies from 
the N 50 computation (which result in 
incorrectly overestimating N 50 ), but this 
is at the expense of underestimating the 
true N 50 . While assembly of sample 
MDA3 with Velvet resulted in only 52% 
coverage of the TDC60 reference genome 
with a very small adjusted N 50 (Table 2), 
assemblies with Velvet-SC and SPAdes 
resulted in 88% and 90% coverage, re- 
spectively, which are close to the 91% 
coverage by reads (Table 1). The adjusted 
N 50 was 10,732 for Velvet-SC and 13,589 
for SPAdes. For MDA2, SPAdes yielded 
a much better assembly than Velvet-SC: 
58% coverage of TDC60 for Velvet-SC 
versus 78% for SPAdes (while the reads 
have 87% coverage) (see Table 1), and 
adjusted N 50 of 1495 for Velvet-SC versus 
5887 for SPAdes. Based on this bench- 
marking, SPAdes assemblies only were 
used for further analyses. 

Following assembly, contigs result- 
ing from nontarget environmental se- 
quences, as well as MDA contaminants, 
were identified and removed from the 



three MDA assemblies using a combination of BLAST and the Au- 
tomated Phylogenetic Inference System (APIS) (Badger et al. 2006), 
as described previously (Chitsaz et al. 2011; Dupont et al. 2012). 
APIS performs a BLAST analysis of each open reading frame (ORF) 
against reference genomes and, when possible, generates a phylo- 
genetic tree for each ORF. APIS analysis confirmed that the ma- 
jority of ORFs on the contigs were placed within the genera Por- 
phyromonas. APIS also helped to identify contaminant contigs as 
those containing ORFs that were phylogenetically similar to each 
other and distinct from the rest of the assembly (Chitsaz et al. 
2011; Dupont et al. 2012). We also assessed the relative proportion 
and phylogenetic association of target and nontarget contigs by 
MGTAXA (http://mgtaxa.jcvi.org) (Supplemental Fig. S5), which 
performs taxonomic classification of metagenomic sequences with 
machine learning techniques. This process also confirmed that the 
majority of contigs for MDA3 belonged to P. gingivalis. 

We investigated the three assembled genomes by remapping 
the three sets of MDA reads to the final contigs generated by 
MDA3. We reported SNPs using the same criteria as described 
above, revealing an insignificant number of SNPs in a few genes 
between the data sets (Supplemental Table S2). We also confirmed 
that the three shared identical 16S rRNA gene sequences based on 
938 bp of alignment from initial PCR-amplified 16S rRNA gene 
results as well as identical full-length 16S rRNA gene sequences in 
the three assemblies. Although the data may provide support that 
these three cells are likely the same strain and the assemblies rep- 
resent the same genome, it is not conclusive, so we chose to treat 
them as separate genomes rather than combine reads to attempt to 
generate a better assembly. Given the fact that MDA3 had the 
highest mapped coverage of the reference genome and the largest 
number of contigs classified as P. gingivalis, we further restricted 
our analyses to MDA3. In total, 288 contigs from MDA3 were 
chosen for further annotation and genomic comparisons. 



Table 2. Comparison of assemblies of single-cell P. gingivalis MDA3 



Assembly 


SPAdes 


E+V-SC 


Velvet 


Number of contigs 3 


614 


452 


3162 


Total length 


2,537,623 b 


2,352,771 


1,442,312 


Largest contig 

N 50 c 


101,845 


82,01 7 


3788 


23,369 


13,391 


283 


Adjusted N 50 d 


13,589 


10,732 


220 


Reference length (strain TDC60) 


2,339,898 


2,339,898 


2,339,898 


Number of breakpoints vs. TDC60 


115 


96 


7 


Number of contigs with breakpoints 


65 


55 


5 


Number of bases in contigs with breakpoints 


1,235,715 


866,637 


6678 


Number of unaligned contigs 6 


351 + 23 part 


91 + 40 part 


493 + 74 part 


Number of unaligned bases 


194,790 


135,565 


205,376 


Average identity (%) f 


96.720 


96.830 


98.860 


Mapped genome (%) g 


90.139 


87.682 


52.175 



Quantities in the table are based on the contigs of size at least 201 bases. 

b The best value in each category is boldfaced when possible; some categories cannot be interpreted as 
having a best value. 

C N 50 is the largest contig length, L, such that contigs of size >L comprise at least half of the bases in the 
reference genome (TDC60). 

d Adjusted N 50 is computed by first aligning the contigs to the TDC60 reference genome, removing the 
nonaligning parts, and breaking the contigs up into blocks at alignment boundaries. With an exact 
reference genome, this would prevent misassemblies (incorrect contig joins) and contaminants from 
inflating the value of N 50 ; however, there may be true differences between this strain and TDC60, which 
this adjustment penalizes. The remaining statistics are also based on the same alignments and alignment 
breakpoints. 

e A contig is partially aligned ("part") when it has alignment(s) to the reference genome, but they 
comprise <99% of the contig's length. 

f The rate of matches in the portions of the contigs that align to the TDC60 reference genome. 
g The fraction of the TDC60 reference genome to which contigs are aligned. 
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General genomic features 

The single-cell genome assembled from MDA3, designated PG 
JCVI SC001, was 2.35 Mb in size and is similar to the other three 
P. gingivalis genomes for a number of genome parameters (Table 
3). A total of 1869 genes (86%) were found to have some level of 
homology in ATCC33277, W83, or TDC60 genomes (Supple- 
mental Table S3), with a total of 1500 genes making up the pan- 
genome encompassing all three genomes (Supplemental Fig. S6). 
The 524 genes unique to PG JCVI SC001, i.e., no ortholog to 
reference genomes, were primarily annotated as hypothetical 
proteins (Table 4; see full table in Supplemental Table S4). Contigs 
were then ordered based on the TDC60 reference genome and 
concatenated to provide a scaffold for further comparative ge- 
nomics. The circular representation of the genome (Fig. 2) dis- 
plays the ordered contigs and predicted CDSs, as well as BLASTN 
analyses to the three P. gingivalis genomes and a distant relative, 
Prevotella buccae ATCC33574. 



Multiple locus sequence typing 

We used the seven gene multiple locus sequence typing (MLST) 
scheme for P. gingivalis (Jolley et al. 2004; Enersen et al. 2008a) 
and sequence types (STs) from human periodontitis isolates of 
P. gingivalis in the MLST database (http://www.pubmlst.org/ 
pgingivalis) (Jolley et al. 2004) to type the PG JCVI SC001 ge- 
nome. MLST detects allelic variation at multiple housekeeping 
loci accumulating slowly in bacterial populations. This database 
has been used to investigate the clonality of P. gingivalis, which 
was previously reported to have a weakly clonal population 
structure comparable with Neisseria menigitidis (Enersen 2011). 
Using the PG JCVI SC001 contig sequences as input, the MLST 
database searches revealed that the sequenced single cell has 
five exact matches to previously sequenced genes of the seven 
MLST loci (ftsQj gpdxj, hagB, mcmA, pepO, pga, recA). Strain 
TDC60 also had only five exact matches to the database, in- 
dicating that they have a unique allele pattern for hagB (SC001), 
gpdxj (TDC60), and pga (both) absent in the database. Using 
existing MLST database tools that generate trees based on the 
allelic profiles of the 138 sequence types (http://www.pubmlst. 
org), we confirmed that the nearest sequence type to PG JCVI 
SC001 is ST-68, having three identical matches (pepO, pga, and 
recA) (Supplemental Fig. S7). 

Analysis of virulence factors 

Diversity in P. gingivalis is reported to arise by genetic recom- 
bination rather than mutation (Frandsen et al. 2001; Koehler et al. 



Table 4. JCVI SCOOI specific CDS (top 12 of 524) identified via 
reciprocal best BLAST analysis 



Raw count 


% a 


Annotation 




40. OO 


Hypothetical protein 


91 


17.27 


Conserved hypothetical protein 


31 


5.88 


Conserved domain protein 


4 


0.76 


Cleaved adhesin domain protein 


4 


0.76 


Site-specific recombinase, phage integrase family 


3 


0.57 


PcfJ-like protein 


3 


0.57 


Peptidase, S9A/B/C family 


3 


0.57 


TraM recognition site of TraD and TraG 


3 


0.57 


Transposase, IS4-like family protein 


3 


0.57 


Lipoprotein, putative 


3 


0.57 


DNA-binding helix-turn-helix protein 


3 


0.57 


Glycosyltransferase, group 1 family protein 



Percentage of CDS specific to JCVI SC001 . 



2003; Nadkarni et al. 2004). In addition to the MLST data, the 
structure and function of major virulence factors, such as fim- 
briae (fimA) and the capsular polysaccharide biosynthesis locus 
(CPS), provide insight into strain variation and degrees of path- 
ogenicity (Lamont and Jenkinson 1998). The single-cell read data 
and de novo-assembled contigs provide the opportunity to ex- 
amine strain variation and pathogenicity of this uncultivated 
strain. In some cases, mapping reads to the reference TD60 ge- 
nome provided deep coverage sufficient to detect SNPs, whereas 
in other highly variable regions, such as the CPS region, reads 
could not be accurately mapped. The CPS region could, however, 
be recovered in the assembled contigs and used for comparative 
genome analyses. 

Coverage and analysis of fimbriae gene 

It is becoming evident that the fimbriae A gene {fimA), encoding 
the major fimbrial subunit of P. gingivalis, is one of the main vir- 
ulent factors of this organism. Based on fimA, P gingivalis is clas- 
sified into six genotypes (genotype I, lb, II, III, IV, and V). Epide- 
miological studies have shown that advanced periodontitis 
patients harbor fimA type II (Enersen et al. 2008a,b; Enersen 201 1). 
It should be noted that fimA type II P. gingivalis is most frequently 
detected in cardiovascular disease patients (Nakagawa et al. 2006) 
and has been shown to invade epithelial cells (Nakagawa et al. 
2006). TDC60 has type II fimbriae, and our analysis of read map- 
ping and de novo assemblies confirms that our single-cell fimA 
sequences are related to the fimA of TDC60. A total of six SNPS in 
fimA are shared by all three MDAs with >10x coverage (Fig. 3), 
giving further confidence that these six SNPs are valid and not 
assembly error or MDA artifacts. 



Table 3. General features of the PG JCVI SC001 genome and comparisons with sequenced 
P. gingivalis genomes 





CRISPR 




Coding base 


Genome 


Gene 


CDS 




RNA 


Strain 


count 


GC% 


count 


size 


count 


count 


CDS % 


count 


ATCC 


3 


48 


2,046,172 


2,354,886 


2155 


2090 


96.98 


65 


W83 


4 


48 


1,954,527 


2,343,476 


1984 


1909 


96.22 


75 


TDC60 


5 


48 


2,040,041 


2,339,898 


2283 


2217 


97.11 


66 


JCVI SC001 a 


1 


48 


2,0327,27 


2,350,571 


2344 


2293 


97.88 


45 


JCVI SC001 b 


1 


48 


1,948,482 


2,350,571 


2165 


2120 


97.98 


48 



a Based on Glimmer gene prediction and JCVI prokaryotic annotation pipeline. 

b Based on XBase (http://www.xbase.ac.uk/annotation/) gene annotation (using P. gingivalis W83 as 
a reference). 



Variation in the capsular polysaccharide 
biosynthesis locus 

The CPS locus has been described as 
a virulence factor of various pathogenic 
bacteria involved in evasion of the host 
immune system. Encapsulated P. gingivalis 
strains such as W83 have been shown to be 
more virulent than nonencapsulated strains 
(e.g., ATCC 33277 in the mouse infection 
model) (Laine and van Winkelhoff 1998; 
Brunner et al. 2010a,b). The CPS loci were 
poorly covered by reads of all three MDA 
data sets, suggesting that these may be 
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region is bounded by shared homology 
in the gene encoding glycosyl trans- 
ferase, group 4 family protein, and epsC 
(UDP-N-acetyl-D-mannosaminuronic acid 
dehydrogenase) upstream and a pair 
of downstream genes encoding UDP- 
N-acetylglucosamine 2-epimerase (epsD) 
and DNA-binding protein HU. The syn- 
teny of genes in this region was more 
similar between the virulent TDC60 
(Watanabe et al. 2011) and the less viru- 
lent ATCC 33277 than between the two 
more virulent strains W83 and TDC60. 
Our MDA3 assembly has only five ORFs 
between epsD and epsC and appears very 
distinct compared with the reference se- 
quences. A recent report has shown that 
a loss in the ability to produce a capsule, 
by deletion of the glycosyl transferase, 
group 4 family protein, increases biofilm 
formation by W83 and ATCC 33277. Al- 
though we do not have an isolate to test if 
these single cells would produce a cap- 
sule, one may speculate based on the ge- 
nomic data that PG JCVI SC001 lacks a 
capsule. 

Evidence of unique CRISPR region 



Figure 2. Circular representation of the draft JCVI SC001 genome. The assembled draft genome is 
the SPAdes assembly of MDA3 with the contigs ordered to the TDC60 reference genome. From the 
inner to the outer ring: coordinates in the assembled and concatenated JCVI SC001 genome, G+C 
content, GCskew, ordered contigs, predicted CDS, TBLASTN alignment showing percent identity 
against P. gingivalis TDC60, W83, ATCC 33277, and Prevotella buccae ATCC33574 (near neighbor) 
reference genomes. 



highly variable regions within the genome. The arrangement 
of the genes between the three sequenced genomes and JCVI 
SC001 single-cell genome from MDA3 (Fig. 4) reveals how the 



Differences between species can be ob- 
served in clustered regularly interspaced 
short palindromic repeats (CRISPR) se- 
quences, which are noncontinuous direct 
repeats separated by variable (spacer) se- 
quences that have been shown to confer 
immunity to phage (Jansen et al. 2002; Barrangou et al. 2007). 
Amplification and de novo assembly was successful for the pre- 
viously identified CRISPR (designated 36-30) within W83; this 
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Figure 3. Single nucleotide polymorphisms and read coverage across fimA of the reference strain TDC60. (Row 1) Reference gene fimA; (row 2) shared 
SNPs across the three single-cell genomes with 1 00% frequency at a coverage of 1 0X; (row 3) shared SNPs at a coverage of 30X; (rows 4-8) SNPs at 1 0x 
and 30x for each single-cell amplification; (rows 9-1 0) mapped deletions; (rows 11-13) mapped reads; (row 14) mapped 454 reads. 
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NC_002950_W83 



NC 010729 ATCC 33277 



NC 015571 TDC60 



MDA3 




Figure 4. Comparison of the polysaccharide capsule locus found in MDA3 (bottom) with W83, ATCC 33277, and TDC60. Genes of the same color are 
from the same orthologous group. 



CRISPR contains eight repeats of 36 bp. Comparisons of this region 
with other genomes (Fig. 5) reveal identical repeat sequences be- 
tween all genomes. This region, however, varies in the number of 
repeats, number of spacer sequences, and spacer identity. Addi- 
tional confirmed CRISPR repeat sequences of 46 bp (five repeats 
and four spacers) as well as putative CRISPR direct repeat se- 
quences of 30, 38, 45, and 52 bp, and associated spacers were 
found in PG JCVI SC001 using the CRIPSR finder database 
(http://crispr.u-psud.fr/crispr/). The 36-30 region is not flanked 
by CRISPR-associated (cas) genes in W83 or TDC60, and all three 
MDA assemblies also failed to capture any cas genes or any ho- 
mologous genes related to the cas system. Zegans et al. (2009) 
noted that loss or disruption of five of the six cas genes results in 
a restoration of biofilm formation in Pseudomonas aeruginosa 
strains that were infected with a lysogenic phage. It is interesting 
to speculate that a lack of cas genes in PG JCVI SC001 may also 
provide an advantage for this strain to 
integrate into a biofilm community, al- 
though no prophage regions were de- 
tected in the MDA products. 



W83 



covery of genomes of bacterial pathogens from single cells out 
of an environmental sample. We demonstrate that a single-cell 
approach enables analysis of the genetic diversity between the 
captured environmental cells and sequenced pathogen genomes, 
permits identification of variations in virulence factors, and sup- 
ports discovery of variant genes in the genome. Single-cell geno- 
mics comparing multiple single cells enables analysis of genetic 
diversity within a population and, although it has inherent biases 
of its own, may potentially be free from biasing effects that can 
occur during subculturing, such as gene loss (Karen et al. 1992; Nair 
et al. 2004). Based on the work reported here, capturing ge- 
nomes from environmental samples using single-cell approaches 
could support studies on the prevalence and genotype of path- 
ogens from environmental sources and may ultimately help re- 
veal their possible modes of transmission between the host and 
environment. 



Discussion 

A vast majority of bacteria in the en- 
vironment as well as those associated 
with the human microbiome have eluded 
standard culturing approaches, and there- 
fore their physiology and their gene con- 
tent are unknown. This leaves a large gap 
in our knowledge of the potential roles 
for these organisms in the environment 
and also in human health and disease. 
This is the first report describing the re- 
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Figure 5. Comparison of a clustered regularly interspaced short palindromic repeat region (CRISPR). 
Successful multiple displacement amplification and de novo assembly of the repeats in CRISPR region 
36-30. This region was first identified in strain W83, and all three genomes have 1 00% identical repeat 
sequences. The regions vary in the number of repeats, number of spacer sequences, and spacer 
identity. 
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Methods 

Sample collection and isolation of bacterial cells from sink 
material 

Sink drain samples were collected with sterile cotton-tipped swabs 
from a publicly accessible restroom adjacent to an emergency 
waiting room within a Medical School Hospital (University of 
California, San Diego, Medical Center, Hillcrest, San Diego, CA, 
USA). The initial sample was fixed with ethanol and vortexed 
briefly for 20 sec (Supplemental Fig. SI). The sample was filtered 
through a 35-|xm filter. A 2-mL cushion of prechilled Nycodenz 
gradient solution (Nycoprep Universal, Axis Shield) was placed in 
a 1 7-mL ultracentrifuge tube, and 6 mL of supernatant was placed 
gently over the Nycodenz cushion. The pair of balanced tubes was 
centrifuged at 9000 rpm for 20 min at 4°C in an ultracentrifuge 
SW32.1 rotor. The visible cloudy interface containing the bacterial 
cells was collected gently, mixed by inversion to create a suspen- 
sion, and diluted 1000-fold for flow cytometry. For further details, 
see the Supplemental Methods. 

FACS detection and single-cell sorting 

FACS detection was performed on the Nycodenz fractionated 
bacteria. Filter-sterilized (0.2 |xm) phosphate-buffered saline (PBS, 
lx) was used as sheath fluid and for sample dilution. Unstained 
and SYBR Green I (0.5 X) stained material was 35-|xm filtered, and 
a 1:1000 dilution was assessed for event rate at low flow rate (<2000 
total events/sec) and adjusted if necessary. The low flow rate is 
critical to reduce the likelihood of sorting coincident events. 
Fluorescent events were collected using gating parameters FSC- 
PMT versus SSC and FSC-PMT versus SYBR Green (Supplemental 
Fig. SI). One thousand events from each gate were sorted at a lower 
purity setting onto glass slides for viewing with fluorescent mi- 
croscopy on an Olympus IX 70 inverted fluorescence microscope at 
60 X magnification to confirm the presence of cellular morphol- 
ogies. A bead targeting strategy (see Supplemental Methods) was 
used to ensure that only single cells were sorted from the sample 
into each well of a 384-well PCR plate (FrameStar). Single-cell 
events were sorted into 4 jjlL of a low EDTATE (10 mM Tris, 0.1 mM 
EDTA at pH 8.0) and immediately frozen on dry ice and held there 
until transfer to -80°C for storage prior to processing. For further 
details, see the Supplemental Methods. 

Multiple displacement amplification 

MDA of single-cell genomes was performed in a 384-well format 
using the GenomiPhi HY kit (GE Healthcare) with a custom Agilent 
BioCel robotic system (see detailed schematic in Supplemental Fig. 
S2). Briefly, cells were lysed by addition of 2 jjlL of alkaline lysis 
solution (645 mM KOH, 265 mM DTT, 2.65 mM EDTA at pH 8.0), 
then incubated for 10 min at 4°C. After lysis, 7 |xL of a neutraliza- 
tion solution (2.8 pL of 1290 mM Tris-Cl at pH 4.5 and 4.2 |xL of GE 
Sample Buffer) was added, followed by 12 |xL of GenomiPhi master 
mix (10.8 jxL of GE Reaction Buffer and 1.2 (jlL of GE Enzyme Mix) 
for a reaction volume of 25 |xL. Reactions were incubated for 16 h at 
30°C, followed by a 10-min inactivation step at 65°C. MDA yield 
was determined by Picogreen assay. Three hundred eighty-four no 
template control (NTC) MDA reactions were included to reveal any 
contaminating sequences and processed in parallel through 16S 
PCR analysis. These negative controls lacking a sorted cell were run 
in parallel to determine the relative amount and identity of con- 
taminating bacterial DNA in the MDA reagents, a necessary stan- 
dard practice in single-cell genomics due to the highly processive 
strand displacement activity of the phi29 DNA polymerase (Allen 



et al. 201 1; Blainey and Quake 201 1; Woyke et al. 201 1). For further 
details, see the Supplemental Methods. 

PCR and analysis of 16S rRNA genes 

16S rRNA genes were amplified from diluted MDA product using 
universal bacterial primers 27f and 1492r (Lane 1991) according to 
previously established protocols (Chitsaz et al. 2011; Eloe et al. 
2011; Dupont et al. 2012) (Supplemental Methods). BLASTN 
analysis against the SILVA SSU ref NR 102 database (Pruesse et al. 
2007) was performed to classify the 16S rRNA gene sequences 
taxonomically and to determine their relationship to sequenced 
bacterial genomes and 16S rRNA gene sequences. An additional 
BLASTN analysis was performed against a curated database of near 
full-length to full-length 16S sequences (at least 900 bp) from human 
fecal sample 16S rRNA sequences (at least 900 bp) from several 
survey studies (Eckburg et al. 2005; Gill et al. 2006; Dethlefsen et al. 
2008; Tap et al. 2009; Turnbaugh et al. 2009) as well as 16S rRNA 
sequences from the Human Oral Microbiome Database (HOMD). 
In both cases, sequences were clustered by cd-hit at the 99% level 
to remove redundancy. This reduced the combined data set of 
57,894 sequences to 28,335 OTU representatives at 99% identity. 
Because the survey sequences had no taxonomy assigned to them, 
they were classified using the SILVA taxonomy via the classifica- 
tion feature of mothur (Schloss et al. 2009). MDA products with 
16S rRNA gene taxonomy similar to those NTC reactions were 
excluded from further analysis. For further details, see the Sup- 
plemental Methods. 

Sequencing 

Nextera transposition-based 454 compatible fragment libraries were 
constructed for each MDA as per the manufacturer's instructions 
(Epicentre Technologies) using MDA genomic DNA as template, 
including incorporation of the 48 Nextera 454 barcodes and use of 
Zymo DNA Clean & Concentrator-5 columns (Zymo Research) fol- 
lowing the transposition and PCR amplification steps. Nextera 
transposition-based Illumina fragment libraries were constructed 
for each MDA as per the manufacturer's instructions (Epicentre 
Technologies). The library concentrations were determined by ab- 
sorbance on a NanoDrop spectrophotometer. The column-purified 
barcoded fragment libraries were pooled at —1:1, without size se- 
lection, and emulsion PCR and sequencing were performed at JTC. 
Illumina sequencing (Bentley 2006) was performed on the MDAs 
using the Genome Analyzer II System according to the manufac- 
turer's specifications. The three MDAs assigned to P. gingivalis were 
barcoded and pooled on a half-lane; this generated 11.5 GB of data 
and 47 million reads that passed quality score >20. 

Reference mapping 

Reference mapping was conducted using the CLC Genomics 
Workbench, with the reference?, gingivalis (NC-015571). Mapping 
parameters were as follows: local alignment with mismatch cost 2, 
insertion cost 3, deletion cost 3, length fraction 0.9, and similarity 
0.9 (90% of the read length needed to be aligned at 90% similarity). 

Single nucleotide polymorphism (SNP) and deletion insertion 
and polymorphism [DIP) analysis 

Parameters for SNP analysis using the CLC Genomics Workbench 
were max # of gaps and mismatches 2, minimum average of quality 
of surrounding bases 30, minimum quality of central base 30, 
minimum coverage 10, minimum paired coverage = 10, minimum 
variant frequency 80%. DIP analysis parameters using the CLC 
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Genomics Workbench were minimum coverage 4, minimum var- 
iant frequency 35%. Coverage above 30 X with a minimum count 
of 10 reads with a frequency of 80% cutoff was considered for the 
detection of shared SNPs. 

Single-cell assemblies 

Assemblies were produced using Velvet-SC (Chitsaz et al. 2011) 
and SPAdes (Bankevich et al. 2012). Both of these assemblers are 
based on the de Bruijn graph, and both have been adapted for 
uneven coverage found in single-cell MDA data sets. SPAdes was 
also adapted for the elevated numbers of chimeric reads and read 
pairs in MDA-amplified data sets. For Velvet-SC, we assembled the 
data with k-mer size k = 55. For SPAdes, we iterated over k-mer sizes 
k = 21, 33, and 55. As described in Results, we selected the SPAdes 
assembly of MDA3 for detailed analysis. 

Contig analyses 

Assemblies were manually curated using a conservative approach 
for single-cell MDA data as described in Chitsaz et al. (2011), Eloe 
et al. (2011), and Dupont et al. (2012). Briefly, all contigs <150 bp 
in length were removed. Taxonomic affiliations of the predicted 
protein sequences on the contigs were assigned using APIS (Badger 
et al. 2006), and contigs that contained a majority of proteins with 
taxonomic affiliations other than the genus-level classification 
Porphyromonas were removed. Further verification of correct tax- 
onomic classification of the contigs was performed using MGTAXA 
software, which performs taxonomic classification of metagenomic 
sequences and is fundamentally based on the frequency of /c-mers 
(http://mgtaxa.jcvi.org). Although the use of combined approaches 
for enrichment of target contigs of interest can remove a fraction of 
potentially new genomic information, it also provided confidence in 
the final data sets since taxonomic affiliation of the sequences are in 
accordance with the 16S rRNA phylogeny. 

Gene annotation 

The assembly from MDA3 that represented most of the genome was 
annotated using the JCVI prokaryotic annotation pipeline (http:// 
www.jcvi.org/cms/research/projects/annotation-service/overview/). 
In addition, the XBASE rapid annotation service (http://www.xbase. 
ac.uk/annotation/) was used to annotate homologs in this genome to 
P. gingivalis strain W83, which XBASE provides as the nearest ref- 
erence genome for the purpose of rapid annotation. 

Orthology 

Reciprocal best BLASTp analysis of genes for reference genomes 
was P. gingivalis TDC60, ATCC33277, W83, and predicted genes of 
PG JCVI SCOOl at a cutoff of 1 X 10" 9 . Annotations for genes that 
were found only in PG JCVI SC001 (524, 22.8%) were from the 
JCVI prokaryotic annotation pipeline. 

Multi Locus Sequence Typing 

This publication made use of the P. gingivalis Multi Locus Sequence 
Typing website (http://pubmlst.org/pgingivalis/) developed by 
Keith Jolley and sited at the University of Oxford. 

Data access 

The genome data have been submitted to the NCBI GenBank 
(http://www.ncbi.nlm.nih.gov/genbank/) under accession num- 
ber PRJNA167667 ID:167667. 
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